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Checkpointing techniques make programs fault-tolerant by saving their state 
periodically and restoring this state after failure. System-level checkpointing saves 
the state of the entire machine on stable storage, but this usually has too much 
overhead. In practice, programmers do manual checkpointing by writing code to (i) 
save the values of key program variables at critical points in the program, and (ii) 
restore ... 
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Understanding distributed applications is a tedious and difficult task. Visualizations 
based on process-time diagrams are often used to obtain a better understanding of 
the execution of the application. The visualization tool we use is Poet, an event 
tracer developed at the University of Waterloo. However, these diagrams are often 
very complex and do not provide the user with the desired overview of the 
application. In our experience, such tools display repeated occurrences of 
non-trivial commun ... 
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The Advanced Automation System is a distributed real-time system under 
development by IBM's Systems Integration Division for the US Federal Aviation 
Administration. The system is intended to replace the present en-route and 
terminal approach US air traffic control computer systems over the next decade. 
High availability of air traffic control services is an essential requirement of the 
system. This paper discusses the general approach to fault-tolerance adopted in 
AAS, by reviewing some of the q ... 
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Full text available: ^ pdf(473.62 Additional Information: full citation , abstract , references , citings , 
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Group communication services are becoming accepted as effective building blocks 
for the construction of fault-tolerant distributed applications. Many specifications for 
group communication services have been proposed. However, there is still no 
agreement about what these specifications should say, especially in cases where 



the services are partitionable, i.e., where communication failures may lead to 
simultaneous creation of groups with disjoint memberships, such that each ... 
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The distributed real-time system services developed by Lockheed Martin's Air Traffic 
Management group serve the infrastructure for a number of air traffic control 
systems. Either completed development or under development are the US Federal 
Aviation Administration's Display System Replacement (DSR) system, the UK Civil 
Aviation Authority's New Enroute Center (IMERC) system, and the Republic of 
China's Air Traffic Control Automated System (ATCAS). These systems are intended 
to replace present ... 

Keywords: exception handling, failure, failure classification, failure masking, failure 
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Full text available: pdf(418.73 Additional Information: full citation , abstract , references , index 
KB) terms , review 

Ensemble is a Group Communication System built at Cornell and the Hebrew 
universities. It allows processes to create process groups within which scalable 
reliable fifo-ordered multicast and point-to-point communication are supported. The 
system also supports other communication properties, such as causal and total 
multicast ordering, flow control, and the like. This article describes the security 
protocols and infrastructure of Ensemble. Applications using Ensemble with the 
extensions des ... 
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Full text available: ^ pdf( 130.79 Additional Information: full citation , abstract , references , citings . 
KB) index terms 

The running times of many computational science applications, such as 
protein-folding using ab initio methods, are much longer than the 
mean-time-to-failure of high-performance computing platforms. To run to 
completion, therefore, these applications must tolerate hardware failures. In this 
paper, we focus on the stopping failure model in which a faulty process hangs and 
stops responding to the rest of the system. We argue that tolerating such faults is 
best done by an approach called appl ... 
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We present the design, implementation, and evaluation of run-time adaptation 
within the River dataflow programming environment. The goal of the River system 
is to provide adaptive mechanisms that allow database query-processing 
applications to cope with performance variations that are common in cluster 
platforms. We describe the system and its basic mechanisms, and carefully 
evaluate those mechanisms and their effectiveness. In our analysis, we answer four 
previously unanswered and important que ... 
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The demand for streaming multimedia applications is growing at an incr edible rate. 
In this paper, we propose Bayeux, an efficient application-level multicast system 
that scales to arbitrarily large receiver groups while tolerating failures in routers 
and network links. Bayeux also includes specific mechanisms for load-balancing 
across replicate root nodes and more efficient bandwidth consumption. Our 
simulation results indicate that Bayeux maintains these properties while keeping 
transmi ... 
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Parallel workstations, each comprising tens of processors based on shared memory, 
promise cost-effective scalable multiprocessing. This article explores the coupling of 
such small- to medium-scale shared-memory multiprocessors through software 
over a local area network to synthesize larger shared-memory systems. We call 
these systems Distributed Shared-memory Multiprocessors (DSMPs). This article 
introduces the design of a shared-memory system that uses multiple granularities 
of sharing, ca ... 
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When distributed systems first appeared, they were programmed in traditional 
sequential languages, usually with the addition of a few library procedures for 
sending and receiving messages. As distributed applications became more 
commonplace and more sophisticated, this ad hoc approach became less 



satisfactory. Researchers all over the world began designing new programming 
languages specifically for implementing distributed applications. These languages 
and their history, their underlying pr ... 
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In this paper, we compare and contrast two techniques to improve capacity/conflict 
miss traffic in CC-NUMA DSM clusters. Page migration/replication optimizes 
read-write accesses to a page used by a single processor by migrating the page to 
that processor and replicates all read-shared pages in the sharers' local memories. 
R-NUMA optimizes read-write accesses to any page by allowing a processor to 
cache that page in its main memory. Page migration/replication requires less 
hardware c ... 
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Virtual networks provide applications with the illusion of having their own 
dedicated, high-performance networks, although network interfaces posses limited, 
shared resources. We present the design of a large-scale virtual network system 
and examine the integration of communication programming interface, system 
resource management, and network interface operation. Our implementation on a 
cluster of 100 workstations quantifies the impact of virtualization on small message 
latencies and throughput ... 
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